Search CORE

821 research outputs found

A critical look at studies applying over-sampling on the TPEHGDB dataset

Author: A García-Blanco
A Smrdel
AJ Hussain
AL Goldberger
DA Silva De
G Fele-Žorž
H Watson
J Ryu
K Subramaniam
L Liu
LJ Meertens
M Shahrdad
MU Ahmed
N Sadi-Ahmed
NV Chawla
P Fergus
P Fergus
P Fergus
P Ren
S Sim
SM Naeem
UR Acharya
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Preterm birth is the leading cause of death among young children and has a large prevalence globally. Machine learning models, based on features extracted from clinical sources such as electronic patient files, yield promising results. In this study, we review similar studies that constructed predictive models based on a publicly available dataset, called the Term-Preterm EHG Database (TPEHGDB), which contains electrohysterogram signals on top of clinical data. These studies often report near-perfect prediction results, by applying over-sampling as a means of data augmentation. We reconstruct these results to show that they can only be achieved when data augmentation is applied on the entire dataset prior to partitioning into training and testing set. This results in (i) samples that are highly correlated to data points from the test set are introduced and added to the training set, and (ii) artificial samples that are highly correlated to points from the training set being added to the test set. Many previously reported results therefore carry little meaning in terms of the actual effectiveness of the model in making predictions on unseen data in a real-world setting. After focusing on the danger of applying over-sampling strategies before data partitioning, we present a realistic baseline for the TPEHGDB dataset and show how the predictive performance and clinical use can be improved by incorporating features from electrohysterogram sensors and by applying over-sampling on the training set

Crossref

Ghent University Academic Bibliography

On the suitability of resampling techniques for the class imbalance problem in credit scoring

Author: A I Marqués
Abrahams CR
Chawla NV
Demšar J
Hochberg Y
J S Sánchez
Japkowicz N
Pluto K
Thomas LC
V García
Vinciotti V
Yen S-J
Zar JH
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

In real-life credit scoring applications, the case in which the class of defaulters is under-represented in comparison with the class of non-defaulters is a very common situation, but it has still received little attention. The present paper investigates the suitability and performance of several resampling techniques when applied in conjunction with statistical and artificial intelligence prediction models over five real-world credit data sets, which have artificially been modified to derive different imbalance ratios (proportion of defaulters and non-defaulters examples). Experimental results demonstrate that the use of resampling methods consistently improves the performance given by the original imbalanced data. Besides, it is also important to note that in general, over-sampling techniques perform better than any under-sampling approach.This work has partially been supported by the Spanish Ministry of Education and Science under grant TIN2009– 14205 and the Generalitat Valenciana under grant PROMETEO/2010/ 028

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Repositori Institucional de la Universitat Jaume I

A swarm intelligence approach in undersampling majority class

Author: A McCluskey
E Keogh
G Feng
GE Batista
H Han
HA Elsalamony
J Bishop
M Beckmann
MM Rifaie al
NV Chawla
NV Chawla
S Moro
V García
Y Sun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/08/2016
Field of study

Over the years, machine learning has been facing the issue of imbalance dataset. It occurs when the number of instances in one class significantly outnumbers the instances in the other class. This study investigates a new approach for balancing the dataset using a swarm intelligence technique, Stochastic Diffusion Search (SDS), to undersample the majority class on a direct marketing dataset. The outcome of the novel application of this swarm intelligence algorithm demonstrates promising results which encourage the possibility of undersampling a majority class by removing redundant data whist protecting the useful data in the dataset. This paper details the behaviour of the proposed algorithm in dealing with this problem and investigates the results which are contrasted against other techniques

Goldsmiths Research Online

Crossref

Greenwich Academic Literature Archive

Maximal regularity for non-autonomous equations with measurable dependence on time

In this paper we study maximal

L^p

-regularity for evolution equations with time-dependent operators

A

. We merely assume a measurable dependence on time. In the first part of the paper we present a new sufficient condition for the

L^p

-boundedness of a class of vector-valued singular integrals which does not rely on H\"ormander conditions in the time variable. This is then used to develop an abstract operator-theoretic approach to maximal regularity. The results are applied to the case of

m

-th order elliptic operators

A

with time and space-dependent coefficients. Here the highest order coefficients are assumed to be measurable in time and continuous in the space variables. This results in an

L^p(L^q)

-theory for such equations for

p,q\in (1, \infty)

. In the final section we extend a well-posedness result for quasilinear equations to the time-dependent setting. Here we give an example of a nonlinear parabolic PDE to which the result can be applied.Comment: Application to a quasilinear equation added. Accepted for publication in Potential Analysi

arXiv.org e-Print Archive

Crossref

TU Delft Repository

Springer - Publisher Connector

Repositorium für Naturwissenschaften und Technik

A prevalent mutation with founder effect in Spanish Recessive Dystrophic Epidermolysis Bullosa families

Author: A Hovnanian
AM Christiano
Carmen Ayuso
Carolina Sánchez-Jimeno
JD Fine
JS Kern
JS Kern
M Csikós
Marcela Del Rio
Marta García
María-José Escámez
María-José Trujillo-Tiebas
MJ Escámez
Natividad Cuadrado-Corrales
Nuria Illera
NV Whittock
RJ Morell
T Strachan
Ángela Hernández-Martín
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Recessive Dystrophic Epidermolysis Bullosa (RDEB) is a genodermatosis caused by more than 500 different mutations in the <it>COL7A1 </it>gene and characterized by blistering of the skin following a minimal friction or mechanical trauma. The identification of a cluster of RDEB pedigrees carrying the c.6527insC mutation in a specific area raises the question of the origin of this mutation from a common ancestor or as a result of a hotspot mutation. The aim of this study was to investigate the origin of the c.6527insC mutation. Methods Haplotypes were constructed by genotyping nine single nucleotides polymorphisms (SNPs) throughout the <it>COL7A1 </it>gene. Haplotypes were determined in RDEB patients and control samples, both of Spanish origin. Results Sixteen different haplotypes were identified in our study. A single haplotype cosegregated with the c.6527insC mutation. Conclusion Haplotype analysis showed that all alleles carrying the c.6527insC mutation shared the same haplotype cosegregating with this mutation (<it>CCGCTCAAA_6527insC</it>), thus suggesting the presence of a common ancestor.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A novel homozygous mutation in the solute carrier family 12 member 3 gene in a Chinese family with Gitelman syndrome

Author: Cruz DN
Fremont OT
Galli-Tsinopoulou A
Gamba G
García-Martín Antonia
Gitelman HJ
Glaudemans B
Graziani G
Knoers NV
Li C
Lin SH
Lü Q
Nakhoul F
Oguz A
Riveira-Munoz E
Riveira-Munoz E
Shao LP
Simon DB
Tago N
Tao H
Tseng MH
Vargas-Poussou R
Verlander JW
Zelikovic I
Publication venue: 'FapUNIFESP (SciELO)'
Publication date: 01/01/2016
Field of study

Crossref

Molecular Approach to the Identification of Fish in the South China Sea

Author: A Bucklin
A Valentini
BC Victor
CL Dudgeon
CP Ornelas-García
D Bensasson
D Rubinoff
D Steinke
D Steinke
DM Lambert
ER Swartza
F Teletchea
G Bernardi
G Bernardi
G Comi
G Piganeau
GG Pegg
H Song
IN Sarkar
Indra Neil Sarkar
J Liu
J Neigel
J Rock
JB Zhang
JH Xiao
JJ Doyle
JS Sparks
JTF Chen
Junbin Zhang
JW Tuckey
K Lucy
L Frézal
LA Ruedas
LL Wong
M Barbuto
M Hajibabaei
M Kimura
M Kochzius
M Nei
ME Hellberg
MH Greenstone
N Hubert
NV Ivanova
NV Ivanova
NV Parin
P Chakrabarty
PDN Hebert
PDN Hebert
PDN Hebert
PH Barber
R DeSalle
R Hanner
RD Ward
RD Ward
Robert Hanner
RS Rasmussen
S Kumar
S Planes
S Ratnasingham
S Santos
SC Shen
SF Chenoweth
SJ Scheffer
SR Palumbi
T Ekrem
ZX Cui
Publication venue: Public Library of Science
Publication date: 17/02/2012
Field of study

BACKGROUND: DNA barcoding is one means of establishing a rapid, accurate, and cost-effective system for the identification of species. It involves the use of short, standard gene targets to create sequence profiles of known species against sequences of unknowns that can be matched and subsequently identified. The Fish Barcode of Life (FISH-BOL) campaign has the primary goal of gathering DNA barcode records for all the world's fish species. As a contribution to FISH-BOL, we examined the degree to which DNA barcoding can discriminate marine fishes from the South China Sea. METHODOLOGY/PRINCIPAL FINDINGS: DNA barcodes of cytochrome oxidase subunit I (COI) were characterized using 1336 specimens that belong to 242 species fishes from the South China Sea. All specimen provenance data (including digital specimen images and geospatial coordinates of collection localities) and collateral sequence information were assembled using Barcode of Life Data System (BOLD; www.barcodinglife.org). Small intraspecific and large interspecific differences create distinct genetic boundaries among most species. In addition, the efficiency of two mitochondrial genes, 16S rRNA (16S) and cytochrome b (cytb), and one nuclear ribosomal gene, 18S rRNA (18S), was also evaluated for a few select groups of species. CONCLUSIONS/SIGNIFICANCE: The present study provides evidence for the effectiveness of DNA barcoding as a tool for monitoring marine biodiversity. Open access data of fishes from the South China Sea can benefit relative applications in ecology and taxonomy

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

Monitoring an Alien Invasion: DNA Barcoding and the Identification of Lionfish and Their Prey on Coral Reefs of the Mexican Caribbean

BACKGROUND: In the Mexican Caribbean, the exotic lionfish Pterois volitans has become a species of great concern because of their predatory habits and rapid expansion onto the Mesoamerican coral reef, the second largest continuous reef system in the world. This is the first report of DNA identification of stomach contents of lionfish using the barcode of life reference database (BOLD). METHODOLOGY/PRINCIPAL FINDINGS: We confirm with barcoding that only Pterois volitans is apparently present in the Mexican Caribbean. We analyzed the stomach contents of 157 specimens of P. volitans from various locations in the region. Based on DNA matches in the Barcode of Life Database (BOLD) and GenBank, we identified fishes from five orders, 14 families, 22 genera and 34 species in the stomach contents. The families with the most species represented were Gobiidae and Apogonidae. Some prey taxa are commercially important species. Seven species were new records for the Mexican Caribbean: Apogon mosavi, Coryphopterus venezuelae, C. thrix, C. tortugae, Lythrypnus minimus, Starksia langi and S. ocellata. DNA matches, as well as the presence of intact lionfish in the stomach contents, indicate some degree of cannibalism, a behavior confirmed in this species by the first time. We obtained 45 distinct crustacean prey sequences, from which only 20 taxa could be identified from the BOLD and GenBank databases. The matches were primarily to Decapoda but only a single taxon could be identified to the species level, Euphausia americana. CONCLUSIONS/SIGNIFICANCE: This technique proved to be an efficient and useful method, especially since prey species could be identified from partially-digested remains. The primary limitation is the lack of comprehensive coverage of potential prey species in the region in the BOLD and GenBank databases, especially among invertebrates

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

An insight into imbalanced Big Data classification: outcomes and challenges

Author: A Fernández
A Fernández
A Thusoo
B Krawczyk
C Bunkhumpornpat
CP Chen
D Lyubimov
E Elsebakhi
E Ramentol
F Hu
F Hu
G Haixiang
GEAPA Batista
GM Weiss
H He
H Yu
I Triguero
I Triguero
J Alcalá-Fdez
J Dean
J Huang
J Li
JA Sáez
JM Tomczak
K Kambatla
L Rokach
M Galar
M Galar
M Wasikowski
NV Chawla
NV Chawla
PC Zikopoulos
R Baeza-Yates
R Barandela
R Blagus
RC Prati
S Alshomrani
S Barua
S Elhag
S Kamal
S Owen
S Río
S Río
S-H Park
T Jo
T White
V García
V López
V López
V López
X Meng
X Wu
Y Guo
Y Sun
Y-S Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Repositorio Institucional Universidad de Granada

Diet and food strategies in a southern al-Andalusian urban environment during Caliphal period, ecija, Sevilla

Author: A García Baena
A Martinez-Cortizas
AM Watson
Andrea Waters-Rist
BF Reilly
BS Chisholm
BT Fuller
BT Fuller
C Melville
C Sirignano
CS Larsen
DH Ubelaker
DJ Reid
E Garcia
E García Sánchez
E Prevedorou
Gina Carroll
GJ Klinken Van
H Bocherens
H Bocherens
HM Liversidge
HW Krueger
I Al-Oumaoui
I Grau-Sologestoa
I Guede
J Kaal
J Lee-Thorp
J Lee-Thorp
J Perez Vidal
J Salas-Salvado
JA Quirós Castillo
JC Carvajal Lopez
JC Sealy
JC Vogel
JL Lee-Thorp
JM Martín Civantos
K Britton
LL Tieszen
M García Garcia
M García García
M Minigawa
M Mundee
M Oliva
M Shatzmiller
MA Katzenberg
ME Díaz-Jorge
MG Lee-Thorp
MH O'Leary
MJ Collins
MJ DeNiro
MJ DeNiro
MJ DeNiro
MJ DeNiro
MJ Kohn
MJ Kohn
MJ Schoeninger
MJ Schoeninger
MM Alexander
MR Menocal
N Alonso
NJ Merwe van der
NV Passalacqua
O López-Costas
O López-Costas
O Nehlich
Olalla López-Costas
P Chalmeta
P Flohr
P Guichard
P Hernández Iñigo
R Longin
REM Hedges
S García-Dils de la Vegas
S Imamuddin
SA Inskip
SA Inskip
Sarah Inskip
SH Ambrose
SH Ambrose
SH Ambrose
SJM Davis
SR Zakrzewski
T Glick
T Sato
TC O'Connell
TF Glick
THE Heaton
WS Broecker
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

The Iberian medieval period is unique in European history due to the widespread socio-cultural changes that took place after the arrival of Arabs, Berbers and Islam in 711 AD. Recently, isotopic research has been insightful on dietary shifts, status, resource availability and the impact of environment. However, there is no published isotopic research exploring these factors in southern Iberian populations, and as the history of this area differs to the northern regions, this leaves a significant lacuna in our knowledge. This research fills this gap via isotopic analysis of human (n = 66) and faunal (n = 13) samples from the 9th to the 13th century Écija, a town renowned for high temperatures and salinity. Stable carbon (δ13C) and nitrogen (δ15N) isotopes were assessed from rib collagen, while carbon (δ13C) values were derived from enamel apatite. Human diet is consistent with C3 plant consumption with a very minor contribution of C4 plants, an interesting feature considering the suitability of Écija to C4 cereal production. δ15N values vary among adults, which may suggest variable animal protein consumption or isotopic variation within animal species due to differences in foddering. Consideration of δ13C collagen and apatite values together may indicate sugarcane consumption, while moderate δ15N values do not suggest a strong aridity or salinity effect. Comparison with other Iberian groups shows similarities relating to time and location rather than by religion, although more multi-isotopic studies combined with zooarchaeology and botany may reveal subtle differences unobservable in carbon and nitrogen collagen studies alone.OLC is funded by Plan Galego I2C mod.B (ED481D 2017/014). The research was partially funded by the projects “Galician Paleodiet” and by Consiliencia network (ED 431D2017/08) Xunta de GaliciaS

Crossref

Repositorio Institucional da Universidade de Santiago de Compostela